Introduction

Try to compare different algorithms on pure single cell results. Both the 10x and C1 SmartSeq sections are using the 81,881 entry ENCODE4 bulk annotation set.

Gene and Transcripts

Cell ranger and STAR Solo only produce Gene level count quantifications. When processing 10x data kallisto and salmon alevin produce gene counts as well.

When processing C1 data, Kallisto & Salmon only produce transcript quantifications, so for C1 gene level comparisons, transcript level quantifications are created by grouping all the transcripts by their gene id and summing those groups.

Correlations Used

This notebook reuses the correlation method developed for the ENCODE3 Bulk RNA pipeline

  • Naive Spearman feeds all sample rows in raw data space to scipy's spearmanr function
  • Rafa Spearman is a filtered spearman correlation that requires both values to be expressed (non-zero) and the average expression between the two samples to be > 0 in $log_2$. (So at least > 1 in the raw data space)

My implementation replicate_scores.

Original R version MAD.R

In [9]:
kallisto_em_common = scanpy.read_h5ad(analysis_dir / 'kallisto_em_filtered.h5ad')
kallisto_em_common
Out[9]:
AnnData object with n_obs × n_vars = 6287 × 81881 
    var: 'gene_symbols'
In [12]:
for k in dense_10x:
    print(k, dense_10x[k].shape)
cellranger (81881, 6287)
solo (81881, 6287)
alevin (81881, 6287)
kallisto (81881, 6287)
kallisto em (81881, 6287)

Scatter plot comparisons

I have several slides where I wanted to get a sense of how the different algorithms performed on a cell.

What I settled on is a panel of 3 scatter plots that show the cell with the lowest correlation, the median correlation, and the best correlation between two selected algorithms.

Data points are in blue, spike-in values are in red and the line y=x is in green.

For 10x data there were no spike ins added, so all the spikes should be at 0,0 for naive, or min_threshold = $log_2 0.01$. Once in a while there's a read spuriously maps to the spike ins and so there might be one non-zero spike in the 10x data.

10x e10.5 full annotation

These plots are derived from ENCSR874BOF our e10.5 mouse forelimb dataset. They were analyzed with indicies built from the full ENCODE Bulk RNA-seq GTF file. This is unusual, many other 10x comparisons are done with a much smaller set that focuses on protein coding, lincRNA and antisense genes as well as the TR and IG annotations. A quick inspection suggests processed transcripts and pseudogenes appear to be the largest categories removed. See compare-10x-vs-ENCODE-gtf for more details.

The 10x cells selected were the intersection of Cellranger, STAR Solo, and Alevin methods, which happens to be the cells selected by STAR Solo.

10x full annotation naive spearman count correlation histogram (50 bins)

In [43]:
plot_cell_correlation_histogram(
    tenx_correlations, 
    'naive_spearman', 
    'Histogram of 10x per-cell {metric} correlations', 
    bins=50,
    programs=['cellranger', 'solo', 'alevin', 'kallisto em']
)
In [44]:
plot_cell_correlation_histogram(
    tenx_correlations, 
    'naive_spearman', 
    'Histogram of 10x per-cell {metric} correlations', 
    bins=50,
    programs=['kallisto', 'kallisto em']
)
In [22]:
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'cellranger', 'solo')
In [23]:
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'cellranger', 'alevin')
In [24]:
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'cellranger', 'kallisto')

10x naive spearman solo vs alevin

In [45]:
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'solo', 'alevin')

10x spearman solo vs kallisto

In [46]:
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'solo', 'kallisto')

10x spearman alevin vs kallisto

In [47]:
show_scatter_extremes(dense_10x, tenx_correlations, 'naive_spearman', 'alevin', 'kallisto')

10x rafa filtered count correlation histogram

In [128]:
plot_cell_correlation_histogram(
    tenx_correlations, 
    'rafa_spearman',
    'Histogram of 10x per-cell {metric} correlations', 
    bins=50,
    programs=['cellranger', 'solo', 'alevin', 'kallisto em']
)
In [49]:
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'cellranger', 'solo')
In [50]:
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'cellranger', 'alevin')
In [51]:
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'cellranger', 'kallisto')
In [52]:
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'solo', 'alevin')
In [53]:
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'solo', 'kallisto')
In [54]:
show_scatter_extremes(dense_10x, tenx_correlations, 'rafa_spearman', 'alevin', 'kallisto')

C1 processed cells

Originally this was also e10.5 but we sequenced our e10.5 C1 cells were also while we were also tuning our spike-in concentrations, and so e10.5 didn't have a reliable spike in concentration. So this analysis is done using all 845 cells that used the final spike-in pool which is a range of time points from e10.5 to P0. Hopefully since we're just compare cell vs cell and don't do any inter cell comparisons this shouldn't impact the analysis.

C1 naive spearman gene count histogram

See compare-kallisto-fragment-and-transcript-length for a deeper analysis. It turns out this version is wrong because the fragment size used was wrong.

In [57]:
plot_cell_correlation_histogram(c1_gene_count_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 count correlations')
In [58]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'rsem')
In [60]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')

C1 gene counts spearman kallisto vs salmon

In [61]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'salmon')

C1 gene counts spearman kallisto vs star

In [62]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'kallisto', 'star')

C1 gene counts rsem vs salmon decoy

In [63]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')

C1 gene counts rsem vs salmon

In [64]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'rsem', 'salmon')

C1 gene counts spearman rsem vs star

In [65]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'rsem', 'star')

C1 gene counts spearman salmon_decoy vs salmon

In [66]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')

C1 gene counts salmon decoy vs star

In [67]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'salmon_decoy', 'star')

C1 gene counts salmon vs star

In [68]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'naive_spearman', 'salmon', 'star')

C1 gene counts rafa spearman histogram

In [69]:
plot_cell_correlation_histogram(c1_gene_count_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 count correlations')
In [70]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'rsem')
In [71]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
In [72]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'salmon')
In [73]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'kallisto', 'star')
In [74]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
In [75]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'rsem', 'salmon')
In [76]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'rsem', 'star')
In [77]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')
In [78]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'salmon_decoy', 'star')
In [79]:
show_scatter_extremes(c1_gene_counts, c1_gene_count_correlations, 'rafa_spearman', 'salmon', 'star')

C1 Gene TPMs

Histogram C1 naive spearman gene TPMs

In [82]:
plot_cell_correlation_histogram(c1_gene_tpm_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 gene TPM correlations', bins=50)
In [83]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'kallisto', 'rsem')
In [84]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
In [85]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon')
In [86]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
In [87]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'rsem', 'salmon')
In [88]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')

C1 gene TPM rafa spearman histogram

In [89]:
plot_cell_correlation_histogram(c1_gene_tpm_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 gene TPM correlations', bins=50)
In [90]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'kallisto', 'rsem')
In [91]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
In [92]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon')
In [93]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
In [94]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon')
In [95]:
show_scatter_extremes(c1_gene_tpms, c1_gene_tpm_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')

C1 transcript count naive spearman correlation histogram

In [98]:
plot_cell_correlation_histogram(c1_transcript_count_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 transcript count correlations', bins=50)
In [99]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'kallisto', 'rsem')
In [100]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
In [101]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'kallisto', 'salmon')
In [102]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
In [103]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'rsem', 'salmon')
In [104]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')

C1 transcript count rafa spearman correlation histogram

In [105]:
plot_cell_correlation_histogram(c1_transcript_count_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 transcript count correlations', bins=50)
In [106]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'kallisto', 'rsem')
In [107]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
In [108]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'kallisto', 'salmon')
In [109]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
In [110]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'rsem', 'salmon')
In [111]:
show_scatter_extremes(c1_transcript_counts, c1_transcript_count_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')

C1 transcript TPM naive spearman correlation histogram

In [114]:
plot_cell_correlation_histogram(c1_transcript_tpm_correlations, 'naive_spearman','Histogram of per-cell {metric} C1 transcript TPM correlations', bins=50)
In [115]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'kallisto', 'rsem')
In [116]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon_decoy')
In [117]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'kallisto', 'salmon')
In [118]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'rsem', 'salmon_decoy')
In [119]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'rsem', 'salmon')
In [120]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'naive_spearman', 'salmon_decoy', 'salmon')

C1 transcript tpm rafa spearman correlation histogram

In [121]:
plot_cell_correlation_histogram(c1_transcript_tpm_correlations, 'rafa_spearman','Histogram of per-cell {metric} C1 transcript TPM correlations', bins=50)
In [122]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'kallisto', 'rsem')
In [123]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon_decoy')
In [124]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'kallisto', 'salmon')
In [125]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon_decoy')
In [126]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'rsem', 'salmon')
In [127]:
show_scatter_extremes(c1_transcript_tpms, c1_transcript_tpm_correlations, 'rafa_spearman', 'salmon_decoy', 'salmon')
In [ ]:
 
In [ ]: